1 Recap

So just to briefly recap what we looked at last week:

  • Matrices - 2 dimensions containing a single type of data
  • Arrays are n dimensional matrices containing a single type of data
  • data.frames (including tibble) can contain multiple data types
  • Lists - structures containing nested information of multiple types
  • Installing packages from CRAN via `install.packages(“package.name”)
  • Installing packages from Github via install_github("r-lib/vroom")
  • We can specify a function to be used from a certain package using the :: operator: vroom::vroom()
  • Load data into R using vroom() via direct pathways, or using a relative pathway
  • Load data from Github using vroom() and the data url
  • Creating pipelines via the magrittr operator %>%
  • Reshaping data from wide to long and long to wide using the pivot_ functions

Today we are going to use tidyverse to further delve into some data manipulation and then learn how to visualise these data too. But first…

1.1 Homework solution

At the end of the last session I set homework to join and reshape the two data frames into a single long format tibble, making sure you have: (1) a column specifying which population the data are from (i.e. population 1 or 2), (2) a column specifying the date the data were collected, (3) a column with the population abundance estimates in it, and (4) that any missing values are dropped from the data. And to do this with the mimumum code! Here is the solution I came up with, see if you managed to do better:

##read in the data
wide_spp.1 <- vroom("https://raw.githubusercontent.com/chrit88/Bioinformatics_data/master/Workshop%203/to_sort_pop_1.csv")
wide_spp.2 <- vroom("https://raw.githubusercontent.com/chrit88/Bioinformatics_data/master/Workshop%203/to_sort_pop_2.csv") 

## code to reshape data
## first join the data using full join - this will keep all of the columns
long_spp <- full_join(wide_spp.1, wide_spp.2) %>%
              ## pivot the joined data frame, using species, primary_threat, secondary_threat, tertiary_threat as ID columns
              ## and using names-pattern to pull out the population number 
              ## and make a new column (called population) to store them in. 
              ##Drop the NAs. 
               pivot_longer(cols = -c(species, 
                                      primary_threat, 
                                      secondary_threat, 
                                      tertiary_threat), 
                            names_to = c("population", "date"),
                            names_pattern = "(.*)_(.*)",
                            values_drop_na = F, 
                            values_to = "abundance")

As we will see later there are some other issues with these data, but for the moment we can leave it at that.

2 More data manipulation

Last week we started to think about how to manipulate data - shaping it into the form we want and need to make analysis easier. This is what you practiced in your homework, and we are going to build on that today. We are going to work with the covid-19 data from last week.


Task

Load in the covid data from: https://raw.githubusercontent.com/chrit88/Bioinformatics_data/master/Workshop%203/time_series_covid19_deaths_global.csv. And then load the following packages: ‘tidyverse’, ‘vroom’


covid_dat <- vroom("https://raw.githubusercontent.com/chrit88/Bioinformatics_data/master/Workshop%203/time_series_covid19_deaths_global.csv")

So if we remember this data is in wide format, with the first 4 columns specifying the province/state, country/region, and lat and long with the rest specifying the number of deahts per day. Lets rearrange that to long format before we go any further. First we need to rename the columns (remind yourself of why in last weeks workbook).


Task

Rename the first two columns in the covid data frame to “Province.State” and “Country.Region”.


##change the first two names of our data frame
names(covid_dat)[1:2] <- c("Province.State", "Country.Region")

Then we need to reshape the data into long format:

##Take the covid data 
covid_long <- covid_dat %>%
                ##and then apply this function 
                pivot_longer(cols = -c(Province.State:Long),
                             names_to = "Date",
                             values_to = "Deaths")

Great, now we can move on to some more advanced data manipulations. The tidyverse’s data manipultions can be broken down into 5 common themes:

  • mutate() adds new variables that are functions of existing variables
  • select() picks variables based on their names.
  • filter() picks cases based on their values.
  • summarise() reduces multiple values down to a single summary.
  • arrange() changes the ordering of the rows.

which can be combined with group_by() to specify the groups within the data we want to apply these to. Lets delve in.

First off lets consider the data we have, useful things we might want to know are:

  1. what the total number of deaths are (globally)
  2. what the number of deaths per day are (globally)
  3. number of deaths per million people in the country

So to calculate things like the number of deaths per million people we need to know the number of people which live in each country - the covid data doesnt include this (its raw counts) so we need to go fishing for it.

2.1 Getting country population data

Luckly the World Bank (WB) does keep this sort of data (along with economic data on countries which we can dive into later), and even more luckily someone has build an R package to allow us to easily access the world bank data: wbstats.


Task

Install wbstats (best to install it from its github repository, try googling in) and load it into R, and find and have a look at its vignette online (google again will help you here).


So, the vignette gives us the code we need to pull out the population data:

##extract the population data for all countries
pop_data <- wb_data(indicator = "SP.POP.TOTL", 
               start_date = 2002, 
               end_date = 2020)


##convert it to a tibble
pop_data <- as_tibble(pop_data)

Task

In the vignette example they work with data up to and including the year 2002. Can you work out what the most recent data are available in pop_data?






We can check what the most recently available data are using the max() function:

##the maximum value of the years in the date column
max(pop_data$date)
## [1] 2020

2.2 Filtering data

So we know that we just want to use the latest available population counts for each country. We can filter these results out of the full pop_data using the filter() function. Filter works with the logical opreators we discussed in the first workshop (e.g. ==, >, >=) as well as functions we havent encountered before (like between()) and ones we have (e.g. mean()) - see here for some examples. As we know we want the data from the most recent available data (2019) we can use this as our filtering argument:

## filter the data to include data from the year 2019 only:
pop_2019 <- pop_data %>% 
              ##only return data where the date is equal to the maximum value in the column "date"
              filter(date == max(pop_data$date))

##look at the data
pop_2019
## # A tibble: 217 x 9
##    iso2c iso3c country   date SP.POP.TOTL unit  obs_status footnote last_updated
##    <chr> <chr> <chr>    <dbl>       <dbl> <chr> <chr>      <chr>    <date>      
##  1 AF    AFG   Afghani…  2020          NA <NA>  <NA>       <NA>     2020-10-15  
##  2 AL    ALB   Albania   2020          NA <NA>  <NA>       <NA>     2020-10-15  
##  3 DZ    DZA   Algeria   2020          NA <NA>  <NA>       <NA>     2020-10-15  
##  4 AS    ASM   America…  2020          NA <NA>  <NA>       <NA>     2020-10-15  
##  5 AD    AND   Andorra   2020          NA <NA>  <NA>       <NA>     2020-10-15  
##  6 AO    AGO   Angola    2020          NA <NA>  <NA>       <NA>     2020-10-15  
##  7 AG    ATG   Antigua…  2020          NA <NA>  <NA>       <NA>     2020-10-15  
##  8 AR    ARG   Argenti…  2020          NA <NA>  <NA>       <NA>     2020-10-15  
##  9 AM    ARM   Armenia   2020          NA <NA>  <NA>       <NA>     2020-10-15  
## 10 AW    ABW   Aruba     2020          NA <NA>  <NA>       <NA>     2020-10-15  
## # … with 207 more rows

You can see from this that we have data not only on true “countries” but also economic regions, as defined by the WB. To be able to use these data we need to merge them into the covid data.

2.3 Cleaning data

Once we join these two data frames we can use the information from the WB to calculate some statistics in the covid data (like deaths per million people). However, before we think about making this join we need to do some careful consideration of whether we can actually join the two data sets in their current form.

Have a look at the structure of the two data sets:

##the covid data
covid_long
## # A tibble: 46,816 x 6
##    Province.State Country.Region   Lat  Long Date    Deaths
##    <chr>          <chr>          <dbl> <dbl> <chr>    <dbl>
##  1 <NA>           Afghanistan     33.9  67.7 1/22/20      0
##  2 <NA>           Afghanistan     33.9  67.7 1/23/20      0
##  3 <NA>           Afghanistan     33.9  67.7 1/24/20      0
##  4 <NA>           Afghanistan     33.9  67.7 1/25/20      0
##  5 <NA>           Afghanistan     33.9  67.7 1/26/20      0
##  6 <NA>           Afghanistan     33.9  67.7 1/27/20      0
##  7 <NA>           Afghanistan     33.9  67.7 1/28/20      0
##  8 <NA>           Afghanistan     33.9  67.7 1/29/20      0
##  9 <NA>           Afghanistan     33.9  67.7 1/30/20      0
## 10 <NA>           Afghanistan     33.9  67.7 1/31/20      0
## # … with 46,806 more rows
##the population data
pop_2019
## # A tibble: 217 x 9
##    iso2c iso3c country   date SP.POP.TOTL unit  obs_status footnote last_updated
##    <chr> <chr> <chr>    <dbl>       <dbl> <chr> <chr>      <chr>    <date>      
##  1 AF    AFG   Afghani…  2020          NA <NA>  <NA>       <NA>     2020-10-15  
##  2 AL    ALB   Albania   2020          NA <NA>  <NA>       <NA>     2020-10-15  
##  3 DZ    DZA   Algeria   2020          NA <NA>  <NA>       <NA>     2020-10-15  
##  4 AS    ASM   America…  2020          NA <NA>  <NA>       <NA>     2020-10-15  
##  5 AD    AND   Andorra   2020          NA <NA>  <NA>       <NA>     2020-10-15  
##  6 AO    AGO   Angola    2020          NA <NA>  <NA>       <NA>     2020-10-15  
##  7 AG    ATG   Antigua…  2020          NA <NA>  <NA>       <NA>     2020-10-15  
##  8 AR    ARG   Argenti…  2020          NA <NA>  <NA>       <NA>     2020-10-15  
##  9 AM    ARM   Armenia   2020          NA <NA>  <NA>       <NA>     2020-10-15  
## 10 AW    ABW   Aruba     2020          NA <NA>  <NA>       <NA>     2020-10-15  
## # … with 207 more rows

Question

Are there any issues you can see if we try and join these two data sets?










Well, we are going to want to join these data.frames with information in the country columns (remember we want to take the total population data for each country from the WB data and add it as a new column to the covid data). We can see that in the WB data we have countries AND economic areas. You can see this more clearly if we look at the unique values in the “country” column:

##the first 10 and last 10 unique values in the country column
## the ; operature acts as a new line - meaning you can run two bits of code which don't interact on the same line
head(unique(pop_2019$country), 10); tail(unique(pop_2019$country), 10)
##  [1] "Afghanistan"         "Albania"             "Algeria"            
##  [4] "American Samoa"      "Andorra"             "Angola"             
##  [7] "Antigua and Barbuda" "Argentina"           "Armenia"            
## [10] "Aruba"
##  [1] "Uruguay"               "Uzbekistan"            "Vanuatu"              
##  [4] "Venezuela, RB"         "Vietnam"               "Virgin Islands (U.S.)"
##  [7] "West Bank and Gaza"    "Yemen, Rep."           "Zambia"               
## [10] "Zimbabwe"

This isn’t a bit problem in itself, we can just drop the values for “countries” in the pop_2019 data set which don’t match the countries we have in the covid data.

There are however two slightly bigger issues which will hopefully become apparent if you look at the structure of the original covid data:

##the covid data
head(covid_dat, 10)
## # A tibble: 10 x 180
##    Province.State Country.Region   Lat   Long `1/22/20` `1/23/20` `1/24/20`
##    <chr>          <chr>          <dbl>  <dbl>     <dbl>     <dbl>     <dbl>
##  1 <NA>           Afghanistan     33.9  67.7          0         0         0
##  2 <NA>           Albania         41.2  20.2          0         0         0
##  3 <NA>           Algeria         28.0   1.66         0         0         0
##  4 <NA>           Andorra         42.5   1.52         0         0         0
##  5 <NA>           Angola         -11.2  17.9          0         0         0
##  6 <NA>           Antigua and B…  17.1 -61.8          0         0         0
##  7 <NA>           Argentina      -38.4 -63.6          0         0         0
##  8 <NA>           Armenia         40.1  45.0          0         0         0
##  9 Australian Ca… Australia      -35.5 149.           0         0         0
## 10 New South Wal… Australia      -33.9 151.           0         0         0
## # … with 173 more variables: `1/25/20` <dbl>, `1/26/20` <dbl>, `1/27/20` <dbl>,
## #   `1/28/20` <dbl>, `1/29/20` <dbl>, `1/30/20` <dbl>, `1/31/20` <dbl>,
## #   `02/01/2020` <dbl>, `02/02/2020` <dbl>, `02/03/2020` <dbl>,
## #   `02/04/2020` <dbl>, `02/05/2020` <dbl>, `02/06/2020` <dbl>,
## #   `02/07/2020` <dbl>, `02/08/2020` <dbl>, `02/09/2020` <dbl>,
## #   `02/10/2020` <dbl>, `02/11/2020` <dbl>, `02/12/2020` <dbl>,
## #   `2/13/20` <dbl>, `2/14/20` <dbl>, `2/15/20` <dbl>, `2/16/20` <dbl>,
## #   `2/17/20` <dbl>, `2/18/20` <dbl>, `2/19/20` <dbl>, `2/20/20` <dbl>,
## #   `2/21/20` <dbl>, `2/22/20` <dbl>, `2/23/20` <dbl>, `2/24/20` <dbl>,
## #   `2/25/20` <dbl>, `2/26/20` <dbl>, `2/27/20` <dbl>, `2/28/20` <dbl>,
## #   `2/29/20` <dbl>, `03/01/2020` <dbl>, `03/02/2020` <dbl>,
## #   `03/03/2020` <dbl>, `03/04/2020` <dbl>, `03/05/2020` <dbl>,
## #   `03/06/2020` <dbl>, `03/07/2020` <dbl>, `03/08/2020` <dbl>,
## #   `03/09/2020` <dbl>, `03/10/2020` <dbl>, `03/11/2020` <dbl>,
## #   `03/12/2020` <dbl>, `3/13/20` <dbl>, `3/14/20` <dbl>, `3/15/20` <dbl>,
## #   `3/16/20` <dbl>, `3/17/20` <dbl>, `3/18/20` <dbl>, `3/19/20` <dbl>,
## #   `3/20/20` <dbl>, `3/21/20` <dbl>, `3/22/20` <dbl>, `3/23/20` <dbl>,
## #   `3/24/20` <dbl>, `3/25/20` <dbl>, `3/26/20` <dbl>, `3/27/20` <dbl>,
## #   `3/28/20` <dbl>, `3/29/20` <dbl>, `3/30/20` <dbl>, `3/31/20` <dbl>,
## #   `04/01/2020` <dbl>, `04/02/2020` <dbl>, `04/03/2020` <dbl>,
## #   `04/04/2020` <dbl>, `04/05/2020` <dbl>, `04/06/2020` <dbl>,
## #   `04/07/2020` <dbl>, `04/08/2020` <dbl>, `04/09/2020` <dbl>,
## #   `04/10/2020` <dbl>, `04/11/2020` <dbl>, `04/12/2020` <dbl>,
## #   `4/13/20` <dbl>, `4/14/20` <dbl>, `4/15/20` <dbl>, `4/16/20` <dbl>,
## #   `4/17/20` <dbl>, `4/18/20` <dbl>, `4/19/20` <dbl>, `4/20/20` <dbl>,
## #   `4/21/20` <dbl>, `4/22/20` <dbl>, `4/23/20` <dbl>, `4/24/20` <dbl>,
## #   `4/25/20` <dbl>, `4/26/20` <dbl>, `4/27/20` <dbl>, `4/28/20` <dbl>,
## #   `4/29/20` <dbl>, `4/30/20` <dbl>, `05/01/2020` <dbl>, `05/02/2020` <dbl>,
## #   `05/03/2020` <dbl>, …

The first are the names of the countries - we will only be able to join the data sets when the country names in each of the data sets match exactly. We will try and solve that later.

The other major issue is the Province.State column in the covid data. Lets have a look more closely at the Australia data:

## just look at the data from Australia:
covid_dat %>% filter(Country.Region == "Australia")
## # A tibble: 8 x 180
##   Province.State Country.Region   Lat  Long `1/22/20` `1/23/20` `1/24/20`
##   <chr>          <chr>          <dbl> <dbl>     <dbl>     <dbl>     <dbl>
## 1 Australian Ca… Australia      -35.5  149.         0         0         0
## 2 New South Wal… Australia      -33.9  151.         0         0         0
## 3 Northern Terr… Australia      -12.5  131.         0         0         0
## 4 Queensland     Australia      -27.5  153.         0         0         0
## 5 South Austral… Australia      -34.9  139.         0         0         0
## 6 Tasmania       Australia      -42.9  147.         0         0         0
## 7 Victoria       Australia      -37.8  145.         0         0         0
## 8 Western Austr… Australia      -32.0  116.         0         0         0
## # … with 173 more variables: `1/25/20` <dbl>, `1/26/20` <dbl>, `1/27/20` <dbl>,
## #   `1/28/20` <dbl>, `1/29/20` <dbl>, `1/30/20` <dbl>, `1/31/20` <dbl>,
## #   `02/01/2020` <dbl>, `02/02/2020` <dbl>, `02/03/2020` <dbl>,
## #   `02/04/2020` <dbl>, `02/05/2020` <dbl>, `02/06/2020` <dbl>,
## #   `02/07/2020` <dbl>, `02/08/2020` <dbl>, `02/09/2020` <dbl>,
## #   `02/10/2020` <dbl>, `02/11/2020` <dbl>, `02/12/2020` <dbl>,
## #   `2/13/20` <dbl>, `2/14/20` <dbl>, `2/15/20` <dbl>, `2/16/20` <dbl>,
## #   `2/17/20` <dbl>, `2/18/20` <dbl>, `2/19/20` <dbl>, `2/20/20` <dbl>,
## #   `2/21/20` <dbl>, `2/22/20` <dbl>, `2/23/20` <dbl>, `2/24/20` <dbl>,
## #   `2/25/20` <dbl>, `2/26/20` <dbl>, `2/27/20` <dbl>, `2/28/20` <dbl>,
## #   `2/29/20` <dbl>, `03/01/2020` <dbl>, `03/02/2020` <dbl>,
## #   `03/03/2020` <dbl>, `03/04/2020` <dbl>, `03/05/2020` <dbl>,
## #   `03/06/2020` <dbl>, `03/07/2020` <dbl>, `03/08/2020` <dbl>,
## #   `03/09/2020` <dbl>, `03/10/2020` <dbl>, `03/11/2020` <dbl>,
## #   `03/12/2020` <dbl>, `3/13/20` <dbl>, `3/14/20` <dbl>, `3/15/20` <dbl>,
## #   `3/16/20` <dbl>, `3/17/20` <dbl>, `3/18/20` <dbl>, `3/19/20` <dbl>,
## #   `3/20/20` <dbl>, `3/21/20` <dbl>, `3/22/20` <dbl>, `3/23/20` <dbl>,
## #   `3/24/20` <dbl>, `3/25/20` <dbl>, `3/26/20` <dbl>, `3/27/20` <dbl>,
## #   `3/28/20` <dbl>, `3/29/20` <dbl>, `3/30/20` <dbl>, `3/31/20` <dbl>,
## #   `04/01/2020` <dbl>, `04/02/2020` <dbl>, `04/03/2020` <dbl>,
## #   `04/04/2020` <dbl>, `04/05/2020` <dbl>, `04/06/2020` <dbl>,
## #   `04/07/2020` <dbl>, `04/08/2020` <dbl>, `04/09/2020` <dbl>,
## #   `04/10/2020` <dbl>, `04/11/2020` <dbl>, `04/12/2020` <dbl>,
## #   `4/13/20` <dbl>, `4/14/20` <dbl>, `4/15/20` <dbl>, `4/16/20` <dbl>,
## #   `4/17/20` <dbl>, `4/18/20` <dbl>, `4/19/20` <dbl>, `4/20/20` <dbl>,
## #   `4/21/20` <dbl>, `4/22/20` <dbl>, `4/23/20` <dbl>, `4/24/20` <dbl>,
## #   `4/25/20` <dbl>, `4/26/20` <dbl>, `4/27/20` <dbl>, `4/28/20` <dbl>,
## #   `4/29/20` <dbl>, `4/30/20` <dbl>, `05/01/2020` <dbl>, `05/02/2020` <dbl>,
## #   `05/03/2020` <dbl>, …

The reason we are looking at the wide (not long) data is now hopefully clear - its easier for us to see whats going on in the ‘Provice.State’ and ‘Country.Region’ column when they arent repeated over and over again! So we can see that Australia is helpfully split up into 8 different territories.


Task

Now filter the WB data to only show the data from Australia, and compare this to the covid data. What are the differences?










## the data for Australia from the WB
pop_2019 %>% filter(country == "Australia")
## # A tibble: 1 x 9
##   iso2c iso3c country    date SP.POP.TOTL unit  obs_status footnote last_updated
##   <chr> <chr> <chr>     <dbl>       <dbl> <chr> <chr>      <chr>    <date>      
## 1 AU    AUS   Australia  2020          NA <NA>  <NA>       <NA>     2020-10-15

You can see these data aren’t split by state or province. So, before we can join these data frames we need to convert the covid data so that we have the total covid deaths for a country as a whole, not split up by region.

We can do this trivially using the tidyverse and the summarise() function. summarise() allows us to specify what function we want to be applied to data (e.g. mean(), sd(), or in our case sum()) and will create a new column of a specified name containing that data.

We can pair this up with the group_by() function, where we can specify the groups in which we want the summarise function to be applied to. We now have a very very powerful tool for rapidly and easily summarising data which is complex (i.e. has lots of groups). It is also very easily read by humans (which is very helpful for sharing code!):

## make a new data.frame from the old covid_long data.frame
covid_country <- covid_long %>% 
                  ## we want to calculate the number of 
                  ##deaths in each country and at each date:
                  group_by(Country.Region, Date) %>% 
                  ## and we want the sum of the "Death" column in these groups:
                  summarise(Deaths = sum(Deaths))
## `summarise()` regrouping output by 'Country.Region' (override with `.groups` argument)
## have a look at the data.frame that is produced:
covid_country
## # A tibble: 33,088 x 3
## # Groups:   Country.Region [188]
##    Country.Region Date       Deaths
##    <chr>          <chr>       <dbl>
##  1 Afghanistan    02/01/2020      0
##  2 Afghanistan    02/02/2020      0
##  3 Afghanistan    02/03/2020      0
##  4 Afghanistan    02/04/2020      0
##  5 Afghanistan    02/05/2020      0
##  6 Afghanistan    02/06/2020      0
##  7 Afghanistan    02/07/2020      0
##  8 Afghanistan    02/08/2020      0
##  9 Afghanistan    02/09/2020      0
## 10 Afghanistan    02/10/2020      0
## # … with 33,078 more rows

So that has sorted the first of our problems - we have removed the regions so the death toll is per country. Fantastic. Now the second problem is the Country names. This issue has been recognised (not just in R) for a long time - its the same issue as airpots have (how exactly do you spell Guarulhos – Governador André Franco Montoro International Airport?). For airports we have a three letter code which specifies all the airports in the world (the IATA airport code, incase you care) and the country naming problem uses the same solution. You can see it in the WB data (the iso3c column, which includes economic areas and countries):

##look at the first row of the WB data:
tail(pop_2019)
## # A tibble: 6 x 9
##   iso2c iso3c country    date SP.POP.TOTL unit  obs_status footnote last_updated
##   <chr> <chr> <chr>     <dbl>       <dbl> <chr> <chr>      <chr>    <date>      
## 1 VN    VNM   Vietnam    2020          NA <NA>  <NA>       <NA>     2020-10-15  
## 2 VI    VIR   Virgin I…  2020          NA <NA>  <NA>       <NA>     2020-10-15  
## 3 PS    PSE   West Ban…  2020          NA <NA>  <NA>       <NA>     2020-10-15  
## 4 YE    YEM   Yemen, R…  2020          NA <NA>  <NA>       <NA>     2020-10-15  
## 5 ZM    ZMB   Zambia     2020          NA <NA>  <NA>       <NA>     2020-10-15  
## 6 ZW    ZWE   Zimbabwe   2020          NA <NA>  <NA>       <NA>     2020-10-15

So what we need to do is find these 3 letter codes for all the countries in our covid data set (and all possible spelling variations of these country names). This seems like an insurmountable challenge, but luckily someone has made an R package to do just that!


Task

Download the countrycode package (on CRAN), and look at the help files for the countrycode() function. See if you can impliment the countrycode() function using the data in covid_country, and then add these country codes to the covid_country data set. Remember you can add a column to a data.frame using the $ operator.


## add a column to covid data containing the 
covid_country$code <- countrycode()













Did you manage to get that to work? Hopefully so. Here is the code I used:

## add a column to covid data containing the 
covid_country$code <- countrycode(covid_country$Country.Region, 
                                  origin = "country.name", 
                                  destination = "iso3c")

We have a warning message thrown up:

Some values were not matched unambiguously: Diamond Princess, Kosovo, MS Zaandam .

We can ignore those (unless we are particularly interested in those countries!).

Again, lets have a quick visual check to see if it has worked as we expected.


Task

Look at the top row of the data frame


##look at the new column we have added to the data set:
head(covid_country, 1)
## # A tibble: 1 x 4
## # Groups:   Country.Region [1]
##   Country.Region Date       Deaths code 
##   <chr>          <chr>       <dbl> <chr>
## 1 Afghanistan    02/01/2020      0 AFG
##compare that to the values in the WB data
pop_2019 %>% filter(iso3c == "AFG")
## # A tibble: 1 x 9
##   iso2c iso3c country    date SP.POP.TOTL unit  obs_status footnote last_updated
##   <chr> <chr> <chr>     <dbl>       <dbl> <chr> <chr>      <chr>    <date>      
## 1 AF    AFG   Afghanis…  2020          NA <NA>  <NA>       <NA>     2020-10-15

Great, looks good!

2.4 Joining data

So now we have sorted the covid data out, we can join the population data from the WB to it. You came across the _join() set of functions during your homework last week (I used full_join() to make the final data set before pivoting it).

We need to join the data again here, but we don’t want to do a full_join().


Question

why would be not want to use full_join()? and can you identify the correct _join() function to use to achieve what we want?

See tidyverse website for more information.






Ok, so looking at the documentation shows us full_join() keeps all of the columns AND rows for both x and y - but remember we have countries in the covid_data and countries AND economic areas in the WB data, so we want to drop these economic areas and just stick to the countries.

We can do this using left_join() (return all rows from x, and all columns from x and y) and remember we want to join this to our cleaned data set covid_country via our new code column and the corresponding column of codes in the WB data (make sure to select the correct column of codes as there are a few!).

We also don’t really want ALL of the columns in the WB data, so we can use the select() function to specify we only want the iso3c and value columns to be included in our join:

##rename the 5th column so it works with the following code
names(pop_2019)[5] <- "value"

##demonstration of what select does:
head(pop_2019 %>% select(iso3c, value))
## # A tibble: 6 x 2
##   iso3c value
##   <chr> <dbl>
## 1 AFG      NA
## 2 ALB      NA
## 3 DZA      NA
## 4 ASM      NA
## 5 AND      NA
## 6 AGO      NA
## join the two data sets:
covid_w_pop <- left_join(covid_country, 
                         pop_2019 %>% select(iso3c, value),
                         by = c("code" = "iso3c"))

##look at the new data set
covid_w_pop
## # A tibble: 33,088 x 5
## # Groups:   Country.Region [188]
##    Country.Region Date       Deaths code  value
##    <chr>          <chr>       <dbl> <chr> <dbl>
##  1 Afghanistan    02/01/2020      0 AFG      NA
##  2 Afghanistan    02/02/2020      0 AFG      NA
##  3 Afghanistan    02/03/2020      0 AFG      NA
##  4 Afghanistan    02/04/2020      0 AFG      NA
##  5 Afghanistan    02/05/2020      0 AFG      NA
##  6 Afghanistan    02/06/2020      0 AFG      NA
##  7 Afghanistan    02/07/2020      0 AFG      NA
##  8 Afghanistan    02/08/2020      0 AFG      NA
##  9 Afghanistan    02/09/2020      0 AFG      NA
## 10 Afghanistan    02/10/2020      0 AFG      NA
## # … with 33,078 more rows

Note - here we are telling _join() that "code" = "iso3c", as otherwise _join() won’t work out where to join the data sets. You can do this with multiple columns (e.g. we could have joined these data by country by date using something like by=c("code" = "country", "date" = "Date"), except we already filtered out the correct year from the WB data AND the date in the covid data set is month/day/year so we would need to extract the year data AND the most recent data is year is 2020 in the covid data and only 2019 in the WB data. (you can see why we did it this way instead!).

We will then change the name of the “value” column to one which is more meaningful. We can do this using the which() function, which takes a logical operator (in this case names(covid_w_pop) == "value") and returns a vector of numbers which is the positions in the vector where the statement is true:

## column names
names(covid_w_pop)
## [1] "Country.Region" "Date"           "Deaths"         "code"          
## [5] "value"
##the ones which are equal to "value"
names(covid_w_pop) == "value"
## [1] FALSE FALSE FALSE FALSE  TRUE
##the position in the vector of the "TRUE" statements
which(names(covid_w_pop) == "value")
## [1] 5

This means you don’t have to specify the location of the “value” manually (i.e. don’t need to do names(covid_w_pop)[5]) so if it ever changes position this code will still work.

##change the name
names(covid_w_pop)[which(names(covid_w_pop) == "value")] <- "Population"

Another quick visual check filtering out a single coutry from each data set and visually check that the population data are the same in each:

## quick visual check
covid_w_pop %>% filter(Country.Region=="Afghanistan" & Date == "1/22/20") 
## # A tibble: 1 x 5
## # Groups:   Country.Region [1]
##   Country.Region Date    Deaths code  Population
##   <chr>          <chr>    <dbl> <chr>      <dbl>
## 1 Afghanistan    1/22/20      0 AFG           NA
pop_2019 %>% filter(country=="Afghanistan")
## # A tibble: 1 x 9
##   iso2c iso3c country      date value unit  obs_status footnote last_updated
##   <chr> <chr> <chr>       <dbl> <dbl> <chr> <chr>      <chr>    <date>      
## 1 AF    AFG   Afghanistan  2020    NA <NA>  <NA>       <NA>     2020-10-15

Now we have joined the data together and done some cleaning we can start to calculate out the statistics we want and then start to think about visualising them.

2.5 Calculating death rates

So we said we might be interested in the following statistics:

  1. what the total number of deaths are
  2. what the number of deaths per day are (globally)
  3. number of deaths per million people in the country

The data are the cumulative number of deaths per country per day, which makes some of these statistics a little tricky to calculate (like the total number of deaths).

2.5.1 Total global deaths

Lets start off with the total number of global deaths. We know the data are cumulative, so the total deaths will be the sum of the data for all countries in the most recent date in the data frame. You can extract this using filter() as we did earier.


Task

Have a go at that now - think about how to find the latest date (we did this earlier) and then filtering by that date to get a data.frame which only has those values in it, and then calculating the number of global deaths from that.






So your code will look something like:

##filter to leave the most recent data
most_recent <- covid_country %>% 
                  filter(Date == max(covid_country$Date))

##sum the deaths
sum(most_recent$Deaths)
## [1] 584124

2.6 Number of deaths per day globally

Next lets consider the number of deaths per day globally. Again we can draw on functions we learned earlier. Hint - think how we need to group the data to calculate this…


Task

calculate the number of deaths/day globally and save them as a new data.frame


##make a new data.frame
global_deaths_day <- ?




So hopefully you realised that you will need to use the group_by() function here, but that this is again a trivial calculation:

## make a new data frame of the global deaths using group_by() and summarise()
global_deaths_day <- covid_country %>% 
                          group_by(Date) %>%
                          summarise("Global.deaths" = sum(Deaths))
## `summarise()` ungrouping output (override with `.groups` argument)

Lets make a quick check to ensure we dont have any issues in the data (like NAs):

which(is.na(global_deaths_day$Global.deaths))
## integer(0)

Great, it doesnt seem like we have any NAs (otherwise the calculation about would be a series of numbers corresponding to the rows in which those NA values are). We can trivially remove any NA’s in our data using the na.rm=T argument:

## make a new data frame of the global deaths using group_by() and summarise()
global_deaths_day <- covid_country %>% 
                          group_by(Date) %>%
                          summarise("Global.deaths" = sum(Deaths, na.rm=T))
## `summarise()` ungrouping output (override with `.groups` argument)

2.7 Deaths per million people

The final statistic we were interested in calculating was the number of deaths per million people. This normalised death rate allows us to compare the rate of increase in deaths between countries which have vastly different population sizes. For this we need to go back to our covid_w_pop data set which after our manipulations earlier has all the data we need.


Task

calculate deaths per million individuals for all of the countries in the covid_w_pop data.






Hopefully you realise that we dont need to do anything complicated here because we have already done all the data sorting earlier on. This just leaves us with a simple vectorised calculation using the deaths and populations of each country:

##calculate deaths per million people and add it to the data.frame
covid_w_pop$Deaths.p.m <- (covid_w_pop$Deaths / covid_w_pop$Population) * 1000000

##look at the data
tail(covid_w_pop)
## # A tibble: 6 x 6
## # Groups:   Country.Region [1]
##   Country.Region Date    Deaths code  Population Deaths.p.m
##   <chr>          <chr>    <dbl> <chr>      <dbl>      <dbl>
## 1 Zimbabwe       6/28/20      6 ZWE           NA         NA
## 2 Zimbabwe       6/29/20      7 ZWE           NA         NA
## 3 Zimbabwe       6/30/20      7 ZWE           NA         NA
## 4 Zimbabwe       7/13/20     19 ZWE           NA         NA
## 5 Zimbabwe       7/14/20     20 ZWE           NA         NA
## 6 Zimbabwe       7/15/20     20 ZWE           NA         NA

2.8 Quick summary

So we have learned how to reshape and join data frames, as well as clean then data and do some calcualtions for sub groups in that data. We have produced two data frames from this: global_deaths_day and covid_w_pop. Next we will start to visualise these data using another tidyverse package: ggplot.

3 Visualising data with ggplot

There are a whole host of methods for making plots in R - in fact making brilliant visualisations is one of the best things about the R language. There are simple methods (like plot() in base R) and there are horribly complex methods, but the most flexible and widely used is ggplot. Its beauty is in its flexibility and automatic handling of complex data, and once you have learned ggplot you won’t have to learn another way of plotting your data in R,

Some of the reasons to use ggplot (stolen from the old ggplot wiki):

  • Automatic legends, colors, etc.

  • The “default” output is much nicer than with base graphics, unless starting from the output of a statistical analysis, where people have written nice things in base.

  • Easy two-colour gradients to distinguish positive and negative values in GAM surfaces.

  • Easy access to ribbons with transparency, for confidence intervals.

  • Store any ggplot2 object for modification or future recall.

  • easy superposition (boxplot + points + lines + …)

  • easy facetting

  • easy legend

  • …etc

Because ggplot is part of the tidyverse you’ve already installed it and loaded it into R. Let’s get started.

3.1 Basic plots using ggplot()

Let’s start off my doing a simple plot using the relatively simply data of the global deaths per day we made earlier. Because ggplot is part of the tidyverse it uses a similar syntax to, say, pivot_() or _join() in the sense that we stack arguments up to create finished chunks of code.

The first building block of ggplot is the ggplot() function itself which makes a ggplot object. We use the ggplot() function to pass ggplot() the data we want to plot, and tell it about some basic aesthetics of the plot. Aesthetics are one of the main tools we use in ggplot() to define what data we want to plot, and how we want that data to be treated. Aesthetic arguments are passed using the aes() argument in the ggplot function, as below:

## make a ggplot object
ggplot(data = global_deaths_day, aes(x = Date, y = Global.deaths))

So here we have told ggplot() we want to use the global_deaths_day data and that the x-axis will be the date column in that data, and the y-axis will be the deaths column in that data.frame.

You will have noticed one thing straight away - we havent actually plotted anything! (and the x-axis is horrible but we will solve that later). All we have made here is the ggplot object (that’s what the function ggplot does) but we haven’t told ggplot what kind of plot we want.

I almost always start plotting data with a scatter plot, as this gives us an easy way of visually checking the data (and doesn’t fail in some of the interesting ways line plots can fail in, which we will run into later).

To make a scatter plot we simply add the function geom_point() to the ggplot object we created. geom_... specify what plot ggplot() should make (see The power of geom_...() for more on this).

So, for a scatter plot:

## make a ggplot object
ggplot(data = global_deaths_day, aes(x = Date, y = Global.deaths)) + 
  ##add a geom to that object (in this case geom_point)
  ## notice the + after the ggplot() argument which allows us to 
  ##split this over multiple lines
  geom_point()

So, you can see something very strange is going on here - the data don’t look at all how we would expect them to (remember this is CUMULATIVE deaths - so it can’t go down!). This is an example of why visualising your data is so important - as humans we are really bad at assessing numbers, but if we present something visually we can (often quicker than a computer doing the same task) pick out issues.

So what is going on here? Well the clue lies in that horrible x-axis in the plot, and also the structure of our data. Have a look again at the data we are trying to plot:

global_deaths_day
## # A tibble: 176 x 2
##    Date       Global.deaths
##    <chr>              <dbl>
##  1 02/01/2020           259
##  2 02/02/2020           362
##  3 02/03/2020           426
##  4 02/04/2020           492
##  5 02/05/2020           564
##  6 02/06/2020           634
##  7 02/07/2020           719
##  8 02/08/2020           806
##  9 02/09/2020           906
## 10 02/10/2020          1013
## # … with 166 more rows

We have two columns - “Global.deaths” (which we can see is a numeric dbl) and “Date” which is a character vector - this is where our issue is being generated. R is treating date as a character (well, fair enough, we haven’t told it not to) and is plotting the values in alphabetical order: a-z, 0-9. Of course this isn’t the same order as the dates fall in the real world (the format of the dates in our data frame is month, day, year - the american format) - and this explains all of our odd looking data. We need to tell R that the column is full of dates. This is also why ggplot() has plotted every single value of the “Date” column - it doesnt know how to simplfy this data and so just presents all of it as if it were a categorical variable. We can solve these issues using as.Date() from base R:

## tell R that the data is a date. 
##We need to specify the format the date data are given in using "%m/%d/%y" 
##(see ?as.Date for help on what this means)
global_deaths_day$Date.corrected <- as.Date(global_deaths_day$Date, format = "%m/%d/%y")

Now if we try to plot these data again (using or noew column of corrected dates) we get the following:

## make a ggplot object
ggplot(data = global_deaths_day, aes(x = Date.corrected, y = Global.deaths)) + geom_point()

Great, so these are the data and that visually looks good!

3.2 The power of geom_...()

Understanding the power of ggplot means understanding the role of two key compotents - geoms and how ggplot deals with grouped data. Let’s consider the geom bit first.

In the first example we specified that we wanted a geom_point() plot. Because the geom_...() is given as a seperate argument after we specify the data and aesthetics we can easily swap between geom_point() and another type of geom_...() without recoding anything but this additional argument. This might sound trivial but its pretty revolutionary - especially when plots become complicated. It also means we can stack geoms - assuming they can be plotted on the same axis:

## a scatter plot
ggplot(data = global_deaths_day, aes(x = Date.corrected, y = Global.deaths)) + 
  ## points
  geom_point()

## a line plot
ggplot(data = global_deaths_day, aes(x = Date.corrected, y = Global.deaths)) + 
  ## lines
  geom_line()

## a scatter and line plot
ggplot(data = global_deaths_day, aes(x = Date.corrected, y = Global.deaths)) + 
  ## points
  geom_point(col = "darkgrey") +
  ## and lines! 
  geom_line(col = "red") 

Note - There are two things to note in the above. (1) you can manually change the colour of the lines or points using col = (more on colours below), and (2) ggplot will layer the lines/points/etc in the order you specify them - i.e. above there are points with a line plotted over the top because we used geom_point() and then geom_line() in the code.

There are a vast number of geom’s available (see here for a complete list) which allow you to plot a vast number of different kinds of graphic (histogram, barplot, map, polygon, point, dotplot, violin plot, etc etc) and the way ggplot is set up means that once you have mastered the basic syntax it opens up a vast number of potential plots for you to make without having to do/learn much additional coding.

3.3 Splitting up your ggplot arguments

Complex plots are facilitated by the splitting your ggplot() arguments across multiple statements. You can do this by saving the ggplot objects as objects in R:

##make the ggplot object
p1 <- ggplot(data = global_deaths_day, aes(x = Date.corrected, y = Global.deaths)) 

## add the graphic (in this case lines)
p1 <- p1 + geom_line()

##plot it
p1

And then you can manipulate this object, for example testing how a graphic might look, but not overwriting it:

##try p1 with points too
p1 + geom_point()

##but see that p1 hasn't been altered, because we didn't overwrite it using "p1 <-"
p1

Alternatively you can easily make 2 versions of the same graphic which are a little different, and compare how they look:

##make the ggplot object
p1<-ggplot(data = global_deaths_day, aes(x=Date.corrected, y=Global.deaths)) 
## add the graphic (in this case lines)
p1<-p1+geom_line()
##make a second plot where you have points too, saved to a new object so you dont overwrite p1:
p2<-p1+geom_point()
##compare the two
p1

p2

3.4 Grouped data

Perhaphs the most powerful aspect of ggplot is the way it deals with grouped data (after all, most data aren’t as simple as what we plotted above). This is because under the superficial code you see and use it accesses all the tools of the tidyverse and thus can group and segregate data really efficiently to plot it.

For this we need to move to our more complex data set - the covid_w_pop:

covid_w_pop
## # A tibble: 33,088 x 6
## # Groups:   Country.Region [188]
##    Country.Region Date       Deaths code  Population Deaths.p.m
##    <chr>          <chr>       <dbl> <chr>      <dbl>      <dbl>
##  1 Afghanistan    02/01/2020      0 AFG           NA         NA
##  2 Afghanistan    02/02/2020      0 AFG           NA         NA
##  3 Afghanistan    02/03/2020      0 AFG           NA         NA
##  4 Afghanistan    02/04/2020      0 AFG           NA         NA
##  5 Afghanistan    02/05/2020      0 AFG           NA         NA
##  6 Afghanistan    02/06/2020      0 AFG           NA         NA
##  7 Afghanistan    02/07/2020      0 AFG           NA         NA
##  8 Afghanistan    02/08/2020      0 AFG           NA         NA
##  9 Afghanistan    02/09/2020      0 AFG           NA         NA
## 10 Afghanistan    02/10/2020      0 AFG           NA         NA
## # … with 33,078 more rows

So in this instance we want to plot the data by country so we can visualise how the virus has spread over time in different places across the world. There are a couple of ways of doing this in ggplot():

3.4.1 Aesthetic grouping

3.4.1.1 Colors

The first is to set some of the aesthetics of the plot to reflect the fact that there are groups within the data. First we need to sort out the date information for this data.frame too.


Task

create a new colum in the covid_w_pop data which contains the date in the correct format.


covid_w_pop$Date.corrected <- as.Date(covid_w_pop$Date, format = "%m/%d/%y")

Then lets make a new ggplot object:

##make the ggplot object
by_country <- ggplot(data = covid_w_pop, aes(x = Date.corrected, y = Deaths)) 

Then we want to add points to this coloured by the different countries. ggplot() can do this automatically using another aesthetic argument - col. We used col above to specify a single colour for a set of points/lines, but if we put it inside the aesthetics argument aes() then we can make ggplot automatically assign colours - remember specifying columns via aes() means ggplot looks in our data for a column matching that name, like it did for the x and y variables. So below we tell ggplot to look for a column called Country.Region in our data frame, and then assign a different colour (col) to each of the groups of data in Country.Region:

##make the ggplot object
by_country + geom_point(aes(col = Country.Region))

You’ll notice that you can’t actually see this plot because the legend is so big (because we have so many levels in the country data). We can surpress the legend with an argument in a new ggplot function called theme(). theme() allows us to change a load of the visual aspects of ggplot that arent to do with the data being plotted (e.g. the presence of lines in the background):

##make the ggplot object
by_country + geom_point(aes(col = Country.Region)) + theme(legend.position = "none")

Better! Now you can see how powerful ggplot can be with really complex data. A very simple argument aes(col = Country.Region) has made a plot with all data from all countries plotted together.

It’s all very messy though, as we are plotting 180 or so countries data onto a single plot. Instead let’s just look at a selection of a few countries which might be of interest.

##make a vector of countries we want to look at:
selec_countries <- c("United Kingdom", "China", "US", "Italy", "France", "Germany")

##use this to filter by for our plot. here using the %in% operature:
sel_country_plot <- ggplot(data = covid_w_pop %>% 
                             filter(Country.Region %in% selec_countries), 
                           aes(x = Date.corrected, y = Deaths)) 

##add a line geom specifying that the colours are dependant on the groups in `Country.Region`
sel_country_plot + geom_line(aes(col=Country.Region))

Its worth noting two things here:

  1. that when you use aes() in any of the functions of ggplot() - e.g. geom_line() - it will look in the data specified when setting up the original ggplot() object - i.e. you dont have to tell it to look in that data each time you add a new function.

  2. this is what happens when you try to plot a line plot that has groups of data, but you dont tell ggplot() that it has groups of data:

## with no grouping
sel_country_plot + geom_line()

ggplot() will plot the data, but there will be a single line connecting all of the data points (this is why I tend to plot data out with geom_point() for initial visualisation).

3.4.1.2 Point/line types

Another option to plot grouped data which is a bit more subdued is to alter the asthetics of either the points or lines which are plotted for each country. Again, ggplot makes this easy:

##set line type by country
sel_country_plot + geom_line(aes(linetype = Country.Region))

##set line type by country
sel_country_plot + geom_point(aes(shape = Country.Region))

3.4.2 Faceting

The other major way to plot grouped data is to spread the data from different groups across different sub-plots (called facets). Again, ggplot() can do this automatically to save us a lot of trouble. We can create these multiple plots using either facet_wrap() or facet_grid() and telling ggplot what the grouping we want to facet by is. facet_wrap() and facet_grid() can take up to two groups to facet by (and will make a grid of plots). facet_grid() as you might expect forces the plots to be on a grid, whilst facet_wrap puts them onto a grid but allows the positions to not be determined by the groups (this sounds complicated, but isn’t and is much easier to understand when you are using them!).

Because you aren’t speficying an aesthetic you dont need the aes() argument for the facet_() functions but you need need to use the ~ operature - which in R speak means “as a function of”. So we want to facet our data as a function of Country.Region. If you have two groups to facet by, you specify them as facet_wrap(x ~ y). In our case we only have one, so we replace the x with a .:

##facet the data by country:
sel_country_plot + geom_line() + facet_wrap(. ~ Country.Region)

You’ll again see that ggplot() does some nice things automatically - for example it gives the facets shared axes to save space and look good. And of course you have probably worked out by now we can stack arguments together:

##facet the data by country:
sel_country_plot + geom_line(aes(col = Country.Region)) + facet_wrap(. ~ Country.Region)

3.5 Saving plots

So we have made some basic plots, how do we now save them so we can use the in a publication?

Well, the best way is to save that plot out as a pdf, using (surprise!) the pdf() function. In this we specify where we want the pdf to be saved to, and the height and width (the default is in inches, but you can change this to cm or something else if you prefer, see ?pdf() for how to do this). The pdf() function is a little unusual in that we run it to start R making a pdf, then we run out code which prints our plot, and then we use dev.off() to stop the pdf() function and finish making the pdf:

##specify the directory and name of the pdf, and the width and height
pdf("~/Desktop/Work/Biofinformatics/Plots/Deaths by country.pdf", width = 6, height = 4)

  ##run your code to print your plot
  sel_country_plot + 
    ##add lines
    geom_line(aes(col = Country.Region)) + 
    ##add facets
    facet_wrap(. ~ Country.Region)

##stop the pdf function and finish the .pdf file
dev.off()

This nice thing about making figure outputs using the pdf() function is that if you make any updates to your figure, they automatically are saved to the .pdf when you run all of the code in your script.

4 Key Concepts

Let’s quickly summarise what we have learned today:

  • Covered more data manipulation (e.g. filter, summarise)
  • Learned to identify issues in data (e.g. country names) and sort those issues out
  • Learned how to join data sets by an ID column (or more than one)
  • Learned to summarise data (e.g. total deaths) and to do this for each group in a data set using group_by()
  • Leaned to do basic data visualisations with ggplot() and identify some common plotting issues
  • Learned about geoms and how they can be used to make differnt sorts of plots
  • Learned how to plot grouped data on the same plot (using different colours, shapes, and line types) and on different plots (facets)
  • Learned how to save your plot

5 Functions covered today

Function/operator Call
Join data sets _join()
Make a data frame into a tibble as_tibble()
Filter out certain data filter()
Apply tidyverse functions to grouped data group_by()
Extract summary data (e.g. mean) summarise()
Find and set country code data countrycode()
Show the numerical position in a vector of logical operators which()
Add together values in a vector sum()
Make a ggplot object ggplot()
Make a scatter plot geom_point()
Make a line plot geom_line()
Adjust theme of ggplot object theme()
Plot grouped data on different plots facet_wrap()
Make a .pdf file of a plot pdf()

6 Cheat sheets

Data manipulation with tidyverse

ggplot

Patchwork - Very useful extension to ggplot() allowing you to easily combine multiple figures together

7 Homework

We will continue our exploration of the data set on abundances you manipulated for homework last week. You’ve already learned how to join and reshape the data, now lets try visualising it. So for homework this week you need to:

  • Visualise the time series data

Which doesn’t sound like a big deal, but there are some intricacies to this. This homwork is about visualising the data in the most useful way possible - i.e. to get the most visually out of the data. To do this you will need to think about that data set as a whole - what are the patterns here? what are the best ways to plot the data? facet or colours or linetypes? This is nuanced and there isn’t a right answer, but you will run into problems with the data plotting because of some things in the data you won’t have spotted yet. This task is about considering what story you want to tell and what is the most important thing to visualise - just like publishing your real data. Try and make as finished a plot as you can (use the ggplot cheatsheet (above) to get help on how to make the plots look nicer if you need it). This excellent Github respository might help you choose colors.

There will also be a prize for the best plot (as judged by me!), and any particularly good plots will be shared around the group.

Good luck!